A Study of Supervised Spam Detection applied to Eight Months of Personal E-Mail
نویسندگان
چکیده
In the last year or two, unwelcome email has grown to the extent that it is inconvenient, annoying and wasteful of computer resources. More significantly, its volume threatens to overwhelm our ability to recognize welcome messages, and hence to destroy our trust in email as a reliable communication medium. An automatic spam filter can mitigate these problems, provided that it acts in a reliable and predictable manner. We evaluate ten spam detection methods embodied in six popular open-source spam filters by applying each method sequentially to all of the e-mail received by one individual (X) from August 2003 through March 2004. These 49,086 messages were originally judged in real-time by X. The messages and judgements were recorded, and reproduced so as to provide the same evaluation suite for all the methods. Five of the methods are derived from Spamassassin [spamassassin.org 2004], a hybrid system which includes both static spam-detection rules and a Bayesian statistical learning component. The purpose of intra-Spamassassin comparison is evaluate the relative contributions of the static and learning components in various configurations. The other methods are all “pure” statistical learning systems, in that they contain essentially no spam-detection rules. We compare these – Bogofilter [Raymond 2004], CRM-114 [Yerazunis 2004a], DSPAM [Zdziarski 2004], SpamBayes [Peters 2004], and Spamprobe [Burton 2002a] – against each other and against the learning component of Spamassassin. Recent evaluations have reported very high accuracies – some higher than 99.9% – for several of these filters [Burton 2002a; Zdziarski 2004; Yerazunis 2004a; Louis 2004; Yerazunis 2004b; Holden 2004]. We contrast these studies and others [Sahami et al. 1998; Graham 2004; Robinson 2004; Androutsopoulos et al. 2000; Tuttle et al. 2004] with ours following the presentation of our results. Our study’s objective is to provide a controlled, realistic, statistically meaningful evaluation of several common filters and the methods they embody. While our study is limited to the extent that X’s email is typical, and to the extent that the ten subject methods represent the state of the art, we present our methods and analysis in sufficient detail that they may be reproduced with other email streams and filter implementations. In addition, we have archived our evaluation suite so that we may use it to evaluate future spam detection methods. Spam filtering may be effected in a number of configurations, which we categorize as: manual, static, unsupervised, and supervised. With manual filtering (see figure 1), the task falls to the email recipient who must examine and classify each message that is received. As mentioned above, this process is tedious and error-prone,
منابع مشابه
A Classification Method for E-mail Spam Using a Hybrid Approach for Feature Selection Optimization
Spam is an unwanted email that is harmful to communications around the world. Spam leads to a growing problem in a personal email, so it would be essential to detect it. Machine learning is very useful to solve this problem as it shows good results in order to learn all the requisite patterns for classification due to its adaptive existence. Nonetheless, in spam detection, there are a large num...
متن کاملA New Hybrid Approach of K-Nearest Neighbors Algorithm with Particle Swarm Optimization for E-Mail Spam Detection
Emails are one of the fastest economic communications. Increasing email users has caused the increase of spam in recent years. As we know, spam not only damages user’s profits, time-consuming and bandwidth, but also has become as a risk to efficiency, reliability, and security of a network. Spam developers are always trying to find ways to escape the existing filters therefore new filters to de...
متن کاملA New Model for Email Spam Detection using Hybrid of Magnetic Optimization Algorithm with Harmony Search Algorithm
Unfortunately, among internet services, users are faced with several unwanted messages that are not even related to their interests and scope, and they contain advertising or even malicious content. Spam email contains a huge collection of infected and malicious advertising emails that harms data destroying and stealing personal information for malicious purposes. In most cases, spam emails con...
متن کاملAnalyzing the Social Structure and Dynamics of E-mail and Spam in Massive Backbone Internet Traffic
E-mail is probably the most popular application on the Internet, with everyday business and personal communications dependent on it. Spam or unsolicited e-mail has been estimated to cost businesses significant amounts of money. However, our understanding of the network-level behavior of legitimate e-mail traffic and how it differs from spam traffic is limited. In this study, we have passively c...
متن کاملAn Effective Model for SMS Spam Detection Using Content-based Features and Averaged Neural Network
In recent years, there has been considerable interest among people to use short message service (SMS) as one of the essential and straightforward communications services on mobile devices. The increased popularity of this service also increased the number of mobile devices attacks such as SMS spam messages. SMS spam messages constitute a real problem to mobile subscribers; this worries telecomm...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2004